Movie Success Predictor

Menna Tageldien - 900182847
Nourhan Nada - 900181003

Problem Statement

The film industry has been booming since the 'Golden Age of Hollywood' in the 1930s. People back then would go to the cinema as often as twice a week. Nowadays, there are hundreds of thousands, if not millions, of movies to watch from the comfort of your own home. The graph below demonstrates the increase in the number of movies released in the United States and Canada in the past 20 years. As a result of the large movie pool, we are often stuck with watching movies that we do not enjoy and the movies end up failing. Most of these movies end up costing their producers millions of dollars, Disney's "Mars Needs Moms" lost almost $143.4 million.

In order to prevent viewer dissatisfaction and producers’ financial loss, we propose a movie success prediction program. Different researchers investigate the success rate of a movie based on budget, release time, and participating actors. An important aspect of predicting the success of a movie is making sure that its resulting revenue is larger than twice its budget. In this manner, producers can predict the success of a potential movie before spending their money. Viewers can also use this program to check if an up-and-coming movie is worth watching.

Dataset

With the abundance of movies in our modern world, there is a wide variety of datasets available. IMDB is one of the most popular online databases with information on movies, their cast, crew, etc.. The IMDB dataset available on Kaggle only contains 1,000 movies and does not provide the movie budgets, which is a major feature required by our program, hence why we will not be using it. Below is a figure of the columns included in the IMDB dataset. The IMDB dataset is 309.77 kB.

Another dataset we found is “The MovieLens Dataset”, it contains 62,000 movies, however it does not contain our targeted features such as the budget and rating. It is instead used in predictors that utilize user reviews to predict a movie’s success. The MovieLens Dataset contains 6 csv files: genome-scores.csv, genome-tags.csv, links.csv, movies.csv, ratings.csv and tags.csv. The MovieLens dataset is 1.07 GB.

Below we will discuss the model we chose to modify. This model used it is own dataset that is collected from Movie Database API (TMDB). TMDB consists of 375,377 movies dating from 1884 to 2018. The dataset was cleaned and reduced to be suitable for the model used in the chosen research. We will be using two datasets. The first being the dataset used in the original model, it has 5,000 movies and contains the features required by our program. Below is a figure of the columns included in this dataset. The first dataset is 477 KB.

The second dataset we are using is acquired from Kaggle and called “The Movies Dataset”. It is a movie dataset with over 45,000 movies and is 944 MB, it contains metadata on the movies from the Full MovieLens Dataset. The files we are mainly targeting in this dataset are movies_metadata.csv and credits.csv. Movies_metadata.cv consists of 23 columns, the most fundamental being budget, revenue, genre, runtime, and vote_average (rating). Credits.csv includes cast information, which is one of the driving factors for the success of movies. We believe these two datasets are not enough and we will be acquiring larger, more suitable datasets before training our program.

Input/Output Examples

The program takes in the movie information as an array and outputs the probability of movie success. If the probability is higher than 50%, this indicates the movie is successful. The movie features it takes as input include the following:

Budget: all costs relating to the development, production, and post-production of a film.
Runtime: total movie time in minutes
Release Year
Average Rating
Rate Count: number of users that rated the movie
Genre
Country
Certification US : movie rating that determines suitability by viewer age

These features are entered into the program, processed, and after it predicts its revenue, the success of a movie is determined.

The input is entered into an array as follows:

['budget', 'runtime', 'year', 'vote_average', 'vote_count', 'certification_US_G', 'certification_US_NC-17', 'certification_US_NR', 'certification_US_PG', 'certification_US_PG-13', 'certification_US_R', 'genre_Action', 'genre_Adventure', 'genre_Animation', 'genre_Comedy', 'genre_Crime', 'genre_Documentary', 'genre_Drama', 'genre_Family', 'genre_Fantasy', 'genre_History', 'genre_Horror', 'genre_Music', 'genre_Mystery', 'genre_Romance', 'genre_Science Fiction', 'genre_TV Movie', 'genre_Thriller', 'genre_War', 'genre_Western', 'country_Argentina', 'country_Australia', 'country_Austria', 'country_Belgium', 'country_Brazil', 'country_Canada', 'country_China', 'country_Czech Republic', 'country_Denmark', 'country_Finland', 'country_France', 'country_Germany', 'country_Hong Kong', 'country_Hungary', 'country_India', 'country_Ireland', 'country_Israel', 'country_Italy', 'country_Japan', 'country_Mexico', 'country_Netherlands', 'country_New Zealand', 'country_Norway', 'country_Philippines', 'country_Romania', 'country_Russia', 'country_Singapore', 'country_South Africa', 'country_South Korea', 'country_Spain', 'country_Sweden', 'country_Switzerland', 'country_Thailand', 'country_Turkey', 'country_United Arab Emirates', 'country_United Kingdom', 'country_United States of America']

Input: [[200000, 100, 2012, 6, 40, 0, 0 , 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]]

Output: 50.0%

State of the art

Orignial Model from Literature

This research predicts a movie's success according to the revenues of the movie. A movie is considered successful if the revenue is twice the budget. The research tries to predict the output value using four different algorithms: logistic regression, K-nearest neighbors, decision tree, and random forest. Each algorithm uses the same predictors and target value. The hyper-parameters are chosen based on sklearn’s grid search to find the best accuracy. The grid search uses a k-fold cross-validation with k=10.

The author of this research accessed the movie database API (TMDB) [6] using the library tmdbsimple. This database consists of 375,377 movies with 16 rows of information. However, 75% of the movies data was dropped because they are missing the revenue and the budget which is an important feature of the model.

Below are the results of the original model . Logistic regression showed the best results of 70.4% accuracy.

Proposed Updates

Update #1: Ran Model

Initial model achieved accuracy of 70.3% in logistic regression

Update #2: Used MLP Classifier

We tried to use grid-search and MLP classifier to get the best hyper-parameters

Update #3: Adjust Parameters

In logistic regression We tried experimenting with different layers, loss functions, etc.However, We got lower Accuracy from intial model.

Update #4: CNN Attempt

We tried to experiement with CNN. However, we faced some errors running the model. We had an attempt to write the code for CNN model but we figured out that it is not suitable for the problem we want to solve and it would probably outputs a lower accuracy.

Update #5: Use Ensemble Learning

Ensemble learning is a general approach to machine learning that seeks better predictive performance by combining the predictions from multiple models.

Decision Tree Classifier

We tried to use Decission Tree Classifier and to experiement different depth. The best accuracy we could achieved is 72.6%. This is higher accuracy than the original model by 2.6%.

Random Forest Classifier

We tried to use Random Forest Classifier and to experiement different depth. The best accuracy we could achieved is 72.9%. This is higher accuracy than the original model by 2.9%.

We also attempted to change the learning rate. We got best accuracy of random forest with learning rate of 0.0000001. The accuracy is 73.166%

Update #6: Decision Tree GridSearch

Since we got appealing results from using Adaboost classifier with decision tree. We decided to use GridSearch classfier to get the best hyper-parameters. This resulted in a test score of 81.3%.

Hyper-parameters = { 'base_estimator__criterion': 'entropy', 'base_estimator__max_depth': 5, 'base_estimator__max_features': None, 'base_estimator__min_samples_split': 9, 'base_estimator__splitter': 'best', 'learning_rate': 0.1, 'n_estimators': 2 }

This is the highest accuracy we could achieve from all our updates

Results

The best Results we achieved is 81.3% compared to 70.2% of the original model. We successful increased the accuracy by around 11.1% using Adaboost classfier with grid search and decision tree.

Technical Information

Programming framework: Jupyter Notebook
Training hardware: Google Colab
Training time: around 30 minutes
Number of epochs: 700 epochs

Conclusion

In conclusion, we used the decision tree classifier along with AdaBoost and GridSearch and successfully enhanced accuracy from 70.3% to 74.4%.

Our future plans for this project includes:

Adapting the model to predict TV show success.
Finding bigger dataset, incorporate more relevant features.
extending the model to predict projected revenue.

This project added to our technical skills. We learned how to train and debug a model. We also learned how to save a model using the pickle library. We learned how to use GridSearch with different models to tune the hyper-parameters. We got introduced to MLPClassifier and learned how to build an ensemble of models like Adaboost.